Measurement of dispersion of PM 2.5 in Thailand using confidence intervals for the coefficient of variation of an inverse Gaussian distribution

Air pollution is a growing concern for the general public in Thailand with PM 2.5 (particulate matter ≤ 2.5 µm) having the greatest impact on health. The inverse Gaussian (IG) distribution is used for examining the frequency of high concentration events and has often been applied to analyze pollution data, with the coefficient of variation (CV) being used to calculate the quantitative difference in PM 2.5 concentrations. Herein, we propose confidence intervals for the CV of an IG distribution based on the generalized confidence interval (GCI), the adjusted generalized confidence interval (AGCI), the bootstrap percentile confidence interval (BPCI), the fiducial confidence interval (FCI), and the fiducial highest posterior density confidence interval (F-HPDCI). The performance of the proposed confidence intervals was evaluated by using their coverage probabilities and average lengths from various scenarios via Monte Carlo simulations. The simulation results indicate that the coverage probabilities of the AGCI and FCI methods were higher than or close to the nominal level in all of test case scenarios. Moreover, FCI outperformed the others for small sample sizes by achieving the shortest average length. The efficacies of the confidence intervals were demonstrated by using PM 2.5 data from the Din Daeng and Bang Khun Thian districts in Bangkok, Thailand.


INTRODUCTION
Air pollution is regarded as a serious environmental threat. Inefficient forms of transportation (polluting fuels and cars), coal-fired power plants, and agricultural and garbage burning are all major causes of air pollution, high levels of which have been linked with several negative health effects (Pope & Dockery, 2006). Short and long-term exposure to air pollution has been linked to negative health outcomes, and respiratory infections, heart problems, and lung cancer are all increased by it, with people who are already sick being subjected to more severe consequences, while children and the elderly are particularly vulnerable. One of the most health-damaging pollutants is PM 2.5 (particulate matter ≤ 2.5 µm), which can penetrate deep into the lungs.
The problem of air pollution is a growing concern for the general public in Thailand. Especially, Bangkok's urban environment is a complex mixture of commercial, residential, and industrial buildings. The growing number of vehicles on the road and increased energy consumption constitute a significant source of PM 2.5 emissions, and consequently, PM 2.5 levels (and population exposure) along roadsides are frequently substantially greater than in other areas. Industrial emissions are also of great concern in some areas. Chuersuwan et al. (2008) investigated PM 10 (particulate matter ≤ 10 µm) and PM 2.5 concentrations in the Bangkok Metropolitan Region over the course of a year and uncovered their key sources in the inner city adjacent to a busy road. In the Bangkok Metropolitan Region, Oanh et al. (2013) reported the mass concentrations of PM 2.5 and gaseous pollutants along specific travel routes and at fixed roadside locations. Since the coefficient of variation (CV) was used in these studies to quantitatively characterize the temporal variation of PM 2.5, we are interested in analyzing PM 2.5 data by using the CV of an IG distribution.
The inverse Gaussian (IG) distribution is an excellent choice for modeling positive and right-skewed data, and some of its statistical features were discussed by Tweedie (1957). The IG distribution has been used in hydrology, cardiology, pharmacokinetics, demography, economics, finance, among others. Schrodinger (1915) used it to model the first passage of the time distribution for Brownian motion. Chhikara & Folks (1989) proposed its application to model lifetime and wind energy data. The IG distribution has also been used to investigate insulating fluid failure time, market incidence models, particle cycle time distribution in the blood, health costs, and air pollution. For example, Kumar & Goel (2016) calculated PM 10, PM 2.5, and PM 1 zones of influence under various driving situations and fitted the pdfs of fixed-site data for different PM types at signalized traffic junctions, for which the inverse Gaussian was found to be the most suitable. Gavril et al. (2006) used the pdf in the analysis of the distribution of air pollution in central Athens; they found that the inverse Gaussian distribution provided a better result than the beta, gamma, and Weibull distributions. Therefore, we are interested in studying PM 2.5 pollution by using the IG distribution.
Numerous investigations on the parameters and confidence intervals for IG distributions have previously been conducted. For example, Arefi, Borzadaran & Vaghei (2008) proposed the likelihood ratio interval and the Wald interval for calculating the mean of an IG distribution. Ye, Ma & Wang (2010) employed the generalized confidence interval (GCI) approach for hypothesis testing and interval estimation for the common mean of several IG populations. Tian & Wilding (2005) presented confidence intervals of the ratio of means of two independent IG populations using modified directed likelihood ratio statistics. Krishnamoorthy & Tian (2008) proposed a method for constructing confidence intervals for IG based on the generalized variable approach and a modified likelihood ratio test. In another study, Ismail & Auda (2007) used Bayesian and fiducial inference via the Gibbs sampling process for IG distributions with Type-II censored data. Jayalath & Chhikara (2020) developed a thorough survival analysis for an IG distribution via the Gibbs sampling method by employing Bayesian and fiducial approaches that require a Monte Carlo Markov Chain (MCMC) method.
The CV, which is the ratio of the standard deviation to the mean (it is useful to measure data dispersion with different units), has been used in a variety of fields, including agriculture, biology, medicine, and finance, with many researchers having developed CV parameters and confidence intervals. Banik & Kibria (2011) presented confidence intervals for estimating the CVs of symmetric and positively skewed distributions. Wongkhao, Niwitpong & Niwitpong (2015) suggested confidence intervals for the CV of a normal distribution. Sangnawakij & Niwitpong (2017) used three approaches: the method of variance of estimates recovery (MOVER), GCI, and the asymptotic confidence interval to establish confidence intervals for the CV of a two-parameter exponential distribution. Niwitpong (2013) created a new confidence interval for the CV of a lognormal distribution with restricted parameters. Yosboonruang, Niwitpong & Niwitpong (2019) provided the confidence intervals for the CV of a lognormal distribution by utilizing Bayesian and fiducial GCI methods. Confidence intervals for the difference between the CVs of Weibull distributions were proposed by La-ongkaew, Niwitpong & Niwitpong (2021). Chankham, Niwitpong & Niwitpong (2019) used GCI and bootstrap percentile confidence interval (BPCI) methods to calculate confidence intervals for the CV of an IG distribution.
The purpose of the current study is to establish new confidence intervals for the CV of an IG distribution by using generalized confidence interval (GCI), adjusted generalized confidence interval (AGCI), bootstrap percentile confidence interval (BPCI), fiducial confidence interval (FCI), and fiducial highest posterior density confidence interval (F-HPDCI) methods. The paper is organized as follows. The ideas of the proposed methods are clarified in Methods section. The simulation results are presented in Results section, and then the proposed methods are applied to the real world datasets, as detailed in An empirical application. The last two sections contain discussion and conclusions on the study.

METHODS
Let X = (X 1 ,X 2 ,...,X n ) be a random sample from an IG distribution denoted as IG (µ,λ), where µ and λ are the mean and scale parameters of X , respectively. Subsequently, the probability density function of X is given by The maximum likelihood estimates (MLEs) of parameters µ and λ are µ =X = 1 n n i i=1 X i and λ −1 i = 1 n i n i j=1 (X −1 ij −X −1 i ), respectively. For notation convenience, we use V i = λ −1 i .X i and V i are mutually independent random variables with respective distributions where χ 2 n i −1 denotes a chi-square distribution with n i −1 degrees of freedom. Reproducing the exponential property of the IG explains the independence of these two statistics. It is simple to show that (X i ,V i ) are form a set of complete sufficient statistics. The population mean and variance of X can respectively be expressed as and Therefore, the CV of X is Here, we present the five methods for constructing confidence intervals for θ .

The GCI method
Since the GCI approach was first introduced by Weerahandi (1993), several researchers have used it to provide statistical inferences (Tsui & Weerahandi, 1989;Weerahandi, 1993;Weerahandi, 1995;Ye & Wang, 2007;Krishnamoorthy & Tian, 2008). After that, Ye, Ma & Wang (2010) presented the generalized pivot quantity (GPQ) criterion for the parameters and the constructed confidence intervals for the common mean of several inverse Gaussian populations. Furthermore, Chankham, Niwitpong & Niwitpong (2019) recommended GCI for constructing confidence intervals for the coefficient of variation of an IG distribution; they found that GCI provide coverage probabilities greater than or equal to the nominal confidence level at 0.95. Therefore, GCI was selected as the baseline for comparison with the proposed methods of this study. The confidence interval for the CV is calculated using the concept of the GPQ. Let X = (X 1 ,X 2 ,...,X n ) be random variables from a distribution defined by probability density function f X (x;θ ,δ), with θ and δ being the sought after and nuisance parameters, respectively. Meanwhile, GPQ R(X ;x,θ ,δ) satisfies the following two conditions. (i) The probability distribution of function R(X ;x,θ ,δ) is independent of the unknown parameters. (ii) The observed value of R(X ;x,θ ,δ), X = x does not depend on the nuisance parameters.
If R(X ;x,θ ,δ) satisfies both conditions, then the GCI for the parameter of interest is calculated using the percentiles of the GPQ. Let [R α/2 ,R 1−α/2 ] be a 100(1 − α)% two-sided GCI for the parameter of interest, where R α and R 1−α are denoted as 100(α/2)% and Ye, Ma & Wang (2010) proposed the respective GPQs for λ i and µ i as follows: where χ 2 n i−1 denotes a chi-squared distribution with n i −1 degrees of freedom. Thus, the GPQ for µ i is given by where d ∼ is approximately distributed and Z i ∼ N (0,1). The approximate distribution is derives from the moment matching method of Chhikara & Folks (1989) is a limiting distribution of N (0,1). Consequently, the observed value of R µ i is µ i . Therefore, the GCI for the CV of an IG distribution is given by The 100(1 − α)% two-sided confidence interval for the CV of an IG distribution based on the GCI method is given by where R θ (α/2) and R θ (1 − α/2) are the 100(α/2)-th and 100(1 − α/2)-th percentiles of the distribution of R θ , respectively.
The following algorithm was used to construct GCI: Algorithm 1 (1) Generate X 1 ,X 2 ,...,X n from an IG distribution.
(3) Generate χ 2 n−1 from a Chi-square distribution with n − 1 degrees of freedom and Z from a standard normal distribution.
(7) Repeat Steps 1-6, 15,000 times to compute the coverage probability and the average length.

The AGCI method
According to Ye, Ma & Wang (2010), an approach similar to the GCI method can be utilized for the single coefficient of variation. The GPQ of λ uses the same of GCI method. Subsequently, Krishnamoorthy & Tian (2008) established an estimated GPQ ofμ i as follows: where t n i −1 denotes a t -distribution with t n i −1 degrees of freedom. However, the denominator can become zero when t n i −1 takes a negative value, and thus Rμ i is an approximate GPQ. Therefore, the AGCI for the CV of an IG distribution is given by Subsequently, the 100(1 − α)% two-sided confidence interval for the CV of an IG distribution based on the AGCI method is given by where Rθ (α/2) and Rθ (1−α/2) which are the 100(α/2)-th and 100(1−α/2)-th percentiles of the distribution of Rθ , respectively can be obtained from the notion of Algorithm 1.

The BPCI method
When applying this method, the distribution of the bootstrap sample statistic is a direct approximation of the data sample (Efron & Tibshirani, 1986). It is proceeded by resampling the data with replacement from the distribution. Chankham, Niwitpong & Niwitpong (2019) reported that BPCI performed more poorly than GCI. However, to provide context, this approach was still included in the comparative analysis. Suppose X = (X 1 ,X 2 ,...,X n ) is a random sample of size n from an IG distribution. Sampling is replaced by X * = (X * 1 ,X * 2 ,...,X * n ), which can be obtained by the bootstrapping the sample B times. Efron & Tibshirani (1986) claimed that a minimum of approximately B = 1,000 bootstrap resamples is usually sufficient for obtaining reasonable accurate confidence interval estimates for CV of IG distribution.
The following algorithm is used to construct BPCI: Algorithm 2 (1) Generate X 1 ,X 2 ,...,X n from an IG distribution.

The FCI method
Although fiducial inference proposed by Fisher (1973) is similar to the Bayesian framework, it does not require prior knowledge of the distribution for estimating the parameters involved. Fiducial inference is the only type of inference with frequentist interpretation that uses conditionality on the data. Hence, it allows the implementation of Gibbs sampling (Geman & Geman, 1984), which is an MCMC method commonly used to generate a sample from the posterior distribution for Bayesian inference by sweeping through a variable to sample from its conditional distribution while the remaining variables are fixed at their current values. The sampling distributions of the MLEs of both µ and λ are used for an IG distribution. The fiducial distributions of µ and λ are easily obtained simply by replacing them with their MLEs when they appear in their sample distributions as follows and λ ∼ (λ/n)χ 2 n−1 , whereμ andλ are the MLEs of the µ and λ.
(5) Burn-in 1,000 samples and compute the parameter of interest.

The F-HPDCI method
Hear, the HPD credible interval of the parameters are obtained by using the MCMC method (Chen & Shao, 1999). It is assumed that each value inside the interval has a higher posterior density than any of the values outside of it (Box & Tiao, 2011). Hence, the 100(1 − α)% two-sided confidence interval for the CV of an IG distribution based on the F-HPDCI is given by The following algorithm is used to construct confidence interval using the F-HPDCI method for the CV of an IG distribution: Algorithm 4 (1) Take the initial values (MLEs) of parameters (µ (0) , λ (0) ) (2) Use algorithm 3 to calculate the parameter of interest.

RESULTS
A Monte Carlo simulation study using the R statistical programming language was conducted to evaluate the performances of the confidence intervals based on GCI, AGCI, BPCI, FCI, and F-HPDCI for the CV of an IG distribution. The sample size was set as n = 5,10,30,50,100, and 200; µ as 0.5 and 1, and λ as 1, 2, 5, and 10. We used 15,000 replications for each parameter combination. Furthermore, 5,000 repetitions were used for the GCI and AGCI method, 1,000 bootstrap samples for the BPCI method, and 20,000 realizations of MCMC using the Gibbs algorithm with a burn-in of 1,000 for the fiducial methods. Assessing the performances of the confidence intervals of the five methods was achieved in terms of their coverage probabilities and the average lengths respectively calculated as and where c(L ≤ θ ≤ U ) is the numbers of simulation replications for θ that lie within the confidence interval. The best-performing confidence interval in each case had a coverage probability is greater than or equal to the nominal confidence level of 0.95 and the shortest average length. M is the number of simulation replications. The computational steps to estimate the coverage probabilities and average length performances of all of the methods were computed by using Algorithm 5. Algorithm 5.

AN EMPIRICAL APPLICATION
Example 1 PM 2.5 from the Din Daeng district of Bangkok were collected by the Pollution Control Department, Thailand (http://air4thai.pcd.go.th/webV2/download.php) due to this area having a high traffic volume (Table 2). Figure 3 exhibits a Q-Q plot of the PM 2.5 data, indicating that an IG distribution is suitable for this dataset. Before computing the confidence intervals, the minimum Akaike information criterion (AIC) and Bayesian information criterion (BIC) were first used to test the best-fitting distribution for these data. These two criteria are respectively defined as and  where L is a likelihood function, k is the number of parameters, and n is the number of recorded measurements. It was found that the PM 2.5 data fit an IG distribution because the AIC and BIC values for this distribution were smaller than the other tested distributions (normal, lognormal, Cauchy, exponential, and Weibull) ( Table 3). The summary statistics were computed: n = 31,μ = 53.1229,λ = 337.9519, and CV = 0.3965. The 95% confidence intervals for the CV of the PM 2.5 dataset by using the five methods are reported in Table 4.These results agree with the simulation results for n = 30 in that the F-HPDCI method performed the best in terms of coverage probability and average length, which supports the simulation results.  Example 2 PM 2.5 data from the Bang Khun Thian district, Bangkok, were collected by the pollution Control Department, Thailand (http://air4thai.pcd.go.th/webV2/download.php) due to many factors contributing to PM 2.5 pollution in this area (e.g., road construction, car repair shops, and small and large factories) ( Table 5). The summary for statistics were n = 31,μ = 56.6161,λ = 225.1443, and CV = 0.5015. A Q-Q plot of these data determining the appropriateness of using an IG distribution is shown in Fig. 4. Furthermore, it was found that the AIC and BIC values for the IG distributions were lower than other tested distributions (Table 6), and thus it provided the best fit for the data. The 95% confidence intervals for the CV of this dataset by using the five methods are reported in Table 7. In agreement with the simulation results for n = 30 the F-HPDCI method provided the best confidence interval performance in terms of coverage probability and average length.

DISCUSSION
The results show that the AGCI method performed well in all of the scenarios with large sample sizes as its coverage probabilities were consistently greater than or close to the nominal confidence level while its average lengths were the shortest. In addition, FCI and  F-HPDCI produced similar results and performed well for small sample sizes. Moreover, when the sample size and λ increased, the average lengths of all of the methods were reduced. The findings of this work are significantly different from previous related studies because we developed a new method for predicting the air pollution level based on fiducial inference derived by using a Gibbs sampler. Moreover, the proposed methods provided the narrowest average lengths, and so can be used to effectively and accurately estimate confidence intervals for various distributions in various fields. Our approach could aid environmentalists and policymakers to monitor air pollution in specific locations and give alarm signals when the air pollution level approaches a dangerous level. Moreover, the proposed method can also be used in air pollution monitoring systems to mitigate the damage caused by climate change and poor air quality. In the same way, the authorities could leverage our research results to control environmental, social, and health impacts  by promoting laws and regulations to ban vehicles with black smoke emission and or diesel engines, forbid people from burning rubbish, develop integrated urban planning with emission reduction policies, replace delivery trucks with electric vehicles, and collect environmental tax or fees according to the ''polluter pays'' principle.

CONCLUSIONS
In summary, this research paper aims to propose the construction of confidence intervals for the CV of an IG distribution by using the GCI, AGCI, BPCI, FCI, and F-HPDCI approaches. The performances of these methods were compared using the coverage probability and the average length via simulations studies. The results show that AGCI and FCI performed the best in situations with large (n = 50,100, and 200) and small (n = 5,10 and 30) samples sizes, respectively, and thus, they can be recommended for constructing confidence intervals for the CV of an IG distribution in these two scenarios. Finally, two real pollution datasets were utilized to analyze the performance of the proposed method in real situations. In future work, carbon monoxide, lead, nitrogen dioxide, ozone, sulfur dioxide, and other criteria pollutants should be examined by the CV of IG distribution. Furthermore, new credible intervals based on Bayesian inference for the CV of an IG distribution could be developed.